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ABSTRACT 

Automatic  language  identification  is  the  process  by  which 
the  language  of  a digitized  speech  utterance  is  recognized 
by  a computer.  In  this  paper,  we  will  describe  the  set  of 
available  cues  for  language  identification  and  discuss  the 
different  approaches  to  building  working  systems.  This 
overview  includes  a range  of  historic  approaches,  con- 
temporary systems  that  have  been  evaluated  on  standard 
databases,  as  well  as  promising  future  approaches.  Com- 
parative results  are  also  reported. 

1.  INTRODUCTION 

Automatic  language  identification  is  the  process  by  which 
the  language  of  a digitized  speech  utterance  is  recognized 
by  a computer.  It  is  one  of  several  processes  in  which  in- 
formation is  extracted  automatically  from  a speech  signal. 

Language-ID  (LID)  applications  fall  into  two  main 
categories:  preprocessing  for  machine  systems  and  pre- 
processing for  human  listeners.  Figure  1 shows  a ho- 
tel lobby  or  international  airport  of  the  future  that  em- 
ploys a multi-lingual  voice-controlled  travel  information 
retrieval  system.  If  no  mode  of  input  other  than  speech 
is  used,  then  the  system  must  be  capable  of  determining 
the  language  of  the  speech  commands  either  while  it  is 
recognizing  the  commands  or  before  it  has  recognized  the 
commands.  Determining  the  language  during  recognition 
would  require  many  speech  recognizers  (one  for  each  lan- 
guage) running  in  parallel.  Because  tens  or  even  hundreds 
of  input  languages  would  need  to  be  supported,  the  cost  of 
the  required  real-time  hardware  might  prove  prohibitive. 
Alternatively,  a language-ID  system  could  be  run  in  ad- 
vance of  the  speech  recognizer.  In  this  case,  the  language- 
ID  system  would  quickly  list  the  most  likely  languages 
of  the  speech  commands,  after  which  the  few  most  ap- 
propriate language-dependent  speech-recognition  models 
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could  be  loaded  and  run  on  the  available  hardware.  A fi- 
nal language-ID  determination  would  be  made  only  after 
speech  recognition  was  complete. 

Figure  2 illustrates  an  example  of  the  second  category 
of  LID  applications — preprocessing  for  human  listeners. 
In  this  case,  LID  is  used  to  route  an  incoming  telephone 
call  to  a human  switchboard  operator  fluent  in  the  corre- 
sponding language.  Such  scenarios  are  already  occurring 
today:  for  example,  AT&T  offers  a Language  Line  inter- 
preter service  to,  among  others,  police  departments  han- 
dling emergency  calls.  When  a caller  to  Language  Line 
does  not  speak  English,  a human  operator  must  attempt 
to  route  the  call  to  an  appropriate  interpreter.  Much  of 
the  process  is  trial  and  error  (for  example,  recordings  of 
greetings  in  various  languages  can  be  used)  and  can  re- 
quire connections  to  several  human  interpreters  before  the 
appropriate  person  is  found.  As  reported  by  Muthusamy 
et  al.  [33],  when  callers  to  Language  Line  do  not  speak 
English,  the  delay  in  finding  a suitable  interpreter  can  be 
on  the  order  of  minutes,  which  could  prove  devastating  in 
an  emergency.  Thus,  a LID  system  that  could  quickly  de- 
termine the  most  likely  languages  of  the  incoming  speech 
might  be  used  to  reduce  the  time  required  to  find  an  ap- 
propriate interpreter  by  one  or  two  orders  of  magnitude. 

2.  LANGUAGE  IDENTIFICATION  CUES 

Humans  and  machines  can  use  a variety  of  cues  to  distin- 
guish one  language  from  another.  The  reader  is  referred 
to  the  linguistics  literature  (e.g.,  [5,  6,  12])  for  in-depth 
discussions  of  how  specific  languages  differ  from  one  an- 
other and  to  Muthusamy  et  al.  [35],  who  has  measured 
how  well  humans  can  perform  language  ID.  In  summary, 
the  following  characteristics  differ  from  language  to  lan- 
guage: 

• Phonology.  A “phoneme”  is  an  underlying  men- 
tal representation  of  a phonological  unit  in  a lan- 
guage. For  example,  the  eight  phonemes  that  com- 
prise the  word  “celebrate”  are  /s  eh  1 ix  b r 
ey  t /.  A “phone”  is  a realization  of  an  acoustic- 
phonetic  unit  or  segment.  It  is  the  actual  sound 
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Figure  1:  A language-identification  (LID)  system  as  a front  end  to  a set  of  real-time  speech  recognizers.  The  LID  system 
outputs  its  three  best  guesses  of  the  language  of  the  spoken  message  (in  this  case,  German,  Dutch,  and  English).  Speech- 
recognizers  are  loaded  with  models  for  these  three  languages  and  make  the  final  LID  decision  (in  this  case,  Dutch)  after 
decoding  the  speech  utterance. 
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Figure  2:  A language-identification  (LID)  system  as  a front  end  to  a multi-lingual  group  of  directory-assistance  or  emer- 
gency operators.  The  LID  system  routes  an  incoming  call  to  a switchboard  operator  fluent  in  the  corresponding  language. 


produced  when  a speaker  is  thinking  of  speak- 
ing a phoneme.  The  phones  that  comprise  the 
world  celebrate  might  be  [s  eh  1 ax  bcl  b r 
ey  q].  As  documented  by  linguists,  phone  and 
phoneme  sets  differ  from  one  language  to  another, 
even  though  many  languages  share  a common  sub- 
set of  phones/phonemes.  Phone/phoneme  frequen- 
cies of  occurrence  may  also  differ,  i.e.,  a phone  may 
occur  in  two  languages,  but  it  may  be  more  fre- 
quent in  one  language  than  the  other.  Phonotactics, 
i.e.,  the  rules  governing  the  sequences  of  allowable 


phones/phonemes,  can  also  be  different. 

• Morphology.  The  word  roots  and  lexicons  are  usu- 
ally different  from  language  to  language.  Each  lan- 
guage has  its  own  vocabulary,  and  its  own  manner 
of  forming  words. 

• Syntax.  The  sentence  patterns  are  different  among 
languages.  Even  when  two  languages  share  a word, 
e.g.,  the  word  “bin”  in  English  and  German,  the  sets 
of  words  that  may  precede  and  follow  the  word  will 
be  different. 
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• Prosody.  Duration  characteristics,  pitch  contours, 
and  stress  patterns  are  different  from  one  language 
to  another. 


3.  LANGUAGE  IDENTIFICATION  SYSTEMS 

Research  in  automatic  language  identification  from  speech 
has  a history  extending  back  to  the  1970s.  A few  repre- 
sentative LID  systems  are  described  below.  The  reader 
will  find  references  to  other  LID  systems  in  reviews  by 
Muthusamy  et  al.  [33]  and  Zissman  [50]. 

Figure  3 shows  the  two  phases  of  LID.  During  the 
“training”  phase,  the  typical  system  is  presented  with  ex- 
amples of  speech  from  a variety  of  languages.  Each  train- 
ing speech  utterance  is  converted  into  a stream  of  fea- 
ture vectors.  These  feature  vectors  are  computed  from 
short  windows  of  the  speech  waveform  (e.g.  20  ms)  dur- 
ing which  the  speech  signal  is  assumed  to  be  somewhat 
stationary.  The  feature  vectors  are  recomputed  regularly 
(e.g.  every  10  ms)  and  contain  spectral  or  cepstral  infor- 
mation about  the  speech  signal  (the  cepstrum  is  the  inverse 
Fourier  transform  of  the  log  magnitude  spectrum;  it  is 
used  in  many  speech  processing  applications).  The  train- 
ing algorithm  analyzes  a sequence  of  such  vectors  and 
produces  one  or  more  models  for  each  language.  These 
models  are  intended  to  represent  a set  of  language  depen- 
dent, fundamental  characteristics  of  the  training  speech  to 
be  used  during  the  next  phase  of  the  LID  process. 

During  the  “recognition”  phase  of  LID,  feature  vec- 
tors computed  from  a new  utterance  are  compared  to  each 
of  the  language-dependent  models.  The  likelihood  that 
the  new  utterance  was  spoken  in  the  same  language  as 
the  speech  used  to  train  each  model  is  computed  and  the 
maximum-likelihood  model  is  found.  The  language  of  the 
speech  that  was  used  to  train  the  model  yielding  maxi- 
mum likelihood  is  hypothesized  as  the  language  of  the  ut- 
terance. 

The  key  issue  becomes  that  of  modeling  the  lan- 
guages. We  will  discuss  a series  of  different  features 
that  have  been  extracted  from  speech,  yielding  increas- 
ing amounts  of  knowledge  at  the  cost  of  rendering  the 
language  identifications  system  more  and  more  complex. 
Some  systems  require  only  the  digitized  speech  utterances 
and  the  corresponding  true  identities  of  the  languages  be- 
ing spoken  because  the  language  models  are  based  sim- 
ply on  the  signal  representation  or  on  self  generated  to- 
ken representation.  More  complicated  LID  systems  use 
phonemes  to  model  speech  and  may  require  either  (1)  a 
phonetic  transcription  (sequence  of  symbols  representing 
the  spoken  sounds),  or  (2)  an  orthographic  transcription 
(the  text  of  the  words  spoken)  along  with  a phonemic 
transcription  dictionary  (mapping  of  words  to  prototypi- 
cal pronunciation)  for  each  training  utterance.  Producing 
these  transcriptions  and  dictionaries  is  an  expensive,  time 
consuming  process  that  usually  requires  a skilled  linguist 
fluent  in  the  language  of  interest. 


3.1.  Spectral-Similarity  Approaches 

In  the  earliest  automatic  language  ID  systems,  developers 
capitalized  on  the  differences  in  spectral  content  among 
languages,  exploiting  the  fact  that  speech  spoken  in  dif- 
ferent languages  contains  different  phonemes  and  phones. 
To  train  these  systems,  a set  of  prototypical  short-term 
spectra  were  computed  and  extracted  from  training  speech 
utterances.  During  recognition,  test  speech  spectra  were 
computed  and  compared  to  the  training  prototypes.  The 
language  of  the  test  speech  was  hypothesized  as  the  lan- 
guage having  training  spectra  that  best  matched  the  test 
spectra. 

There  were  several  variations  on  this  spectral  similar- 
ity theme.  The  training  and  testing  spectra  could  be  used 
directly  as  feature  vectors,  or  they  could  be  used  instead  to 
compute  formant-based  or  cepstral  features  vectors.  The 
training  exemplars  could  be  chosen  either  directly  from 
the  training  speech  or  could  be  synthesized  through  the 
use  of  K-means  clustering.  The  spectral-similarity  could 
be  calculated  by  the  Euclidean,  Mahalanobis,  or  some 
other  distance  metric.  Examples  of  spectral  similarity  LID 
systems  are  those  proposed  and  developed  by  Cimarusti 
[4],  Foil  [11],  Goodman  [13],  and  Sugiyama  [45]. 

To  compute  the  similarity  between  a test  utterance  and 
a training  model,  most  of  the  early  spectral-similarity  sys- 
tems calculated  the  distance  between  each  test  utterance 
vector  and  each  training  exemplar.  The  distance  between 
each  test  vector  and  its  closest  exemplar  was  accumulated 
as  an  overall  distance,  and  the  language  model  having 
lowest  overall  distance  was  found.  In  a generalization 
of  this  vector  quantization  approach  to  LID,  Riek  [40], 
Nakagawa  [37]  and  Zissman  [49]  applied  Gaussian  mix- 
ture classifiers  to  language  identification.  Here,  each  fea- 
ture vector  is  assumed  to  be  drawn  randomly  according 
to  a probability  density  that  is  a weighted  sum  of  multi- 
variate Gaussian  densities.  During  training,  a Gaussian 
mixture  model  for  the  spectral  or  cepstral  feature  vectors 
is  created  for  each  language.  During  recognition,  the  like- 
lihood of  the  test  utterance  feature  vectors  is  computed 
given  each  of  the  training  models.  The  language  of  the 
model  having  maximum  likelihood  is  hypothesized.  The 
Gaussian  mixture  approach  is  “soft”  vector  quantization, 
where  more  than  one  exemplar  created  during  training  im- 
pacts the  scoring  of  each  test  vector. 

Whereas  the  language  identification  systems  described 
above  perform  primarily  static  classification,  hidden  Mar- 
kov models  (HMMs)  [38],  which  have  the  ability  to  model 
sequential  characteristics  of  speech  production,  have  also 
been  applied  to  LID.  HMM-based  language  identification 
was  first  proposed  by  House  and  Neuburg  [17].  Savic 
[41],  Riek  [40],  Nakagawa  [37],  and  Zissman  [49]  all 
applied  HMMs  to  spectral  and  cepstral  feature  vectors. 
In  these  systems,  HMM  training  was  performed  on  unla- 
beled training  speech.  Riek  and  Zissman  found  that  HMM 
systems  trained  in  this  unsupervised  manner  did  not  per- 
form as  well  as  some  of  the  static  classifiers  that  each  had 
been  testing,  though  Nakagawa  eventually  obtained  bet- 
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Figure  3:  The  two  phases  of  language  identification.  During  training,  speech  waveforms  are  analyzed  and  language- 
dependent  models  are  produced.  During  recognition,  a new  speech  utterance  is  processed  and  compared  to  the  models 
produced  during  training.  The  language  of  the  speech  utterance  is  hypothesized. 


ter  performance  for  his  HMM  approach  than  his  static  ap- 
proaches [36]. 

Li  [26]  has  proposed  the  use  of  novel  features  for 
spectral-similarity  LID.  In  his  system,  the  syllable  nu- 
clei (i.e.  vowels)  for  each  speech  utterance  are  located 
automatically.  Next,  feature  vectors  containing  spectral 
information  are  computed  for  regions  near  the  spectral 
nuclei.  Each  of  these  vectors  is  comprised  of  spectral 
sub-vectors  computed  on  neighboring  (but  not  necessar- 
ily adjacent)  frames  of  speech  data.  Rather  than  collect- 
ing and  modeling  these  vectors  over  all  training  speech, 
Li  keeps  separate  collections  of  feature  vectors  for  each 
training  speaker.  During  testing,  syllable  nuclei  of  the  test 
utterance  are  located  and  feature  vector  extraction  is  per- 
formed. Each  speaker-dependent  set  of  training  features 
vectors  is  compared  to  the  feature  vectors  of  the  test  utter- 
ance, and  the  most  similar  speaker-dependent  set  of  train- 
ing vectors  is  found.  The  language  of  the  speech  spoken 
by  the  speaker  of  that  set  of  training  vectors  is  hypothe- 
sized as  the  language  of  the  test  utterance. 

3.2.  Prosody-based  Approaches 

Features  that  carry  prosodic  information  have  also  been 
used  as  input  to  automatic  language  identification  sys- 
tems. This  has  been  motivated,  in  part,  by  studies  showing 
that  humans  can  use  prosodic  features  for  identifying  the 
language  of  speech  utterances  [35,  31].  For  example,  Ita- 
hashi  has  built  systems  that  use  features  based  on  pitch 
estimates  alone  [18,  19].  He  argues  that  pitch  estimation 
is  more  robust  in  noisy  environments  than  spectral  param- 
eters. 

Hazen  [14],  however,  showed  that  features  derived 
from  prosodic  information  provided  little  language  dis- 
criminability  when  compared  to  a phonetic  system.  A 
system  that  used  both  prosodic  and  phonetic  parameters 
performed  about  the  same  as  a system  using  phonetic  pa- 
rameters alone. 


Finally,  Thyme-Gobbel  et  al.  [47]  have  also  looked 
at  the  utility  of  prosodic  cues  for  language  identification. 
Parameters  were  designed  to  capture  pitch  and  amplitude 
contours  on  a syllable-by-syllable  basis.  They  were  nor- 
malized to  be  insensitive  to  overall  amplitude,  pitch  and 
speaking  rate.  Results  show  that  prosodic  parameters  can 
be  useful  for  discriminating  one  language  from  another; 
however,  the  accuracy  of  any  particular  set  of  features  is 
highly  language-pair  specific. 

3.3.  Phone-Recognition  Approaches 

Given  that  different  languages  have  different  phone  in- 
ventories, many  researchers  have  built  LID  systems  that 
hypothesize  exactly  which  phones  are  being  spoken  as 
a function  of  time  and  determine  the  language  based  on 
the  statistics  of  that  phone  sequence.  For  example,  Lamel 
built  two  HMM-based  phone  recognizers:  one  in  English 
and  another  in  French  [25].  These  phone  recognizers  were 
then  run  over  test  data  spoken  either  in  English  or  French. 
Lamel  et  al.  found  that  the  likelihood  scores  emanat- 
ing from  language-dependent  phone  recognizers  can  be 
used  to  discriminate  between  English  and  French  speech. 
Muthusamy  et  al.  ran  a similar  system  on  English  vs. 
Japanese  spontaneous,  telephone-speech  [32]. 

The  novelty  of  these  phone-based  systems  was  the  in- 
corporation of  more  knowledge  into  the  LID  system.  Both 
Lamel  et  al.  and  Muthusamy  et  al.  trained  their  sys- 
tems with  multi-language  phonetically  labeled  corpora. 
Because  the  systems  require  phonetically-labeled  training 
speech  utterances  in  each  language,  as  compared  to  the 
spectral-similarity  systems  which  do  not  require  such  la- 
bels, it  can  be  more  difficult  to  incorporate  new  languages 
into  the  language  recognition  process.  This  problem  will 
be  addressed  further  in  Section  3.4. 

To  make  phone-recognition-based  LID  systems  easier 
to  train,  one  can  use  a single-language  phone  recognizer 
as  a front  end  to  a system  that  uses  phonotactic  scores  to 
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perform  LID.  Phonotactics  are  the  language-dependent  set 
of  constraints  specifying  which  phonemes  are  allowed  to 
follow  other  phonemes.  For  example,  the  German  word 
“spiel”  which  is  pronounced  /sh  p iy  If  and  might  be 
spelled  in  English  as  “shpeel”  begins  with  a consonant 
cluster  /sh  p / that  cannot  occur  in  English  (except  if  one 
word  ends  in  /sh/  and  the  next  begins  with  /p/,  or  in  a 
compound  word  like  “flashpoint”).  This  approach  is  rem- 
iniscent of  the  work  of  D’  Amore  [9, 21],  Schmitt  [42],  and 
Damashek  [8],  who  have  used  n-gram  analysis  of  text  doc- 
uments to  perform  language  and  topic  identification  and 
clustering.  By  “tokenizing”  the  speech  message,  i.e.  con- 
verting the  input  waveform  to  a sequence  of  phone  sym- 
bols, the  statistics  of  the  resulting  symbol  sequences  can 
be  used  to  perform  language  identification.  Hazen  [15] 
and  Zissman  [51]  each  developed  LID  systems  that  use 
one,  single-language  front  end  phone  recognizer.  An  im- 
portant finding  of  these  researchers  was  that  language  ID 
could  be  performed  successfully  even  when  the  front  end 
phone  recognizer(s)  was  not  trained  on  speech  spoken  in 
the  languages  to  be  recognized.  For  example,  accurate 
Spanish  vs.  Japanese  LID  can  be  performed  using  only 
an  English  phone  recognizer.  Zissman  [51]  and  Yan  [48] 
have  extended  this  work  to  systems  containing  multiple, 
single-language  front  ends,  where  there  need  not  be  a front 
end  in  each  language  to  be  identified.  Figure  4 shows  an 
example  of  these  types  of  systems. 

3.4.  Using  Multilingual  Speech  Units 

Alternative  approaches  to  training  language  dependent 
phoneme  recognizers  use  multi-lingual  speech  units.  These 
are  derived  by  either  a mixture  of  language  dependent 
and  language  independent  phones  or  by  deriving  tokens 
automatically  from  training  data.  Advantages  of  this  ap- 
proach include  data  sharing  and  discriminant  training  be- 
tween phonemes  across  languages  and  easy  bootstrapping 
to  unseen  languages  [10], 

Research  has  also  focused  on  the  problem  of  iden- 
tifying and  processing  only  those  phones  that  carry  the 
most  language  discriminating  information  [1,  52].  These 
language-dependent  phones  are  called  “mono-phonemes” 
or  “key-phones”  in  the  literature.  Kwan  [24]  and  Dals- 
gaard  [7]  use  both  language  specific  and  language  in- 
dependent phones  in  their  systems.  The  language-  in- 
dependent phones,  sometimes  called  “poly-phones”,  can 
be  trained  on  data  from  more  than  one  language  with- 
out loss  of  language  ID  accuracy.  Berkling  [2],  and 
Kohler  [22,  23]  have  also  tested  systems  that  use  a single 
multi  language  front  end  phone  recognizer,  i.e.,  a recog- 
nizer containing  a mixture  of  “poly-phones”  and  “mono- 
phones”. 

3.5.  Word  Level  Approaches 

Between  phone-level  systems  described  in  the  previous 
sections  and  the  large-vocabulary  speech  recognition  sys- 
tems described  in  a subsequent  section  are  “word-level” 


approaches  to  language  ID.  These  systems  use  more  so- 
phisticated sequence  modeling  than  the  phonotactic  mod- 
els of  the  phone-level  systems,  but  do  net  employ  full 
speech-to-text  systems. 

Kadambe  [20]  proposed  the  use  of  lexical  modeling 
for  language  identification.  An  incoming  utterance  is  pro- 
cessed by  parallel  language-dependent  phone  recogniz- 
ers. Hypothesized  language-specific  word  occurences  are 
identified  from  the  resulting  phone  sequences.  Each  lan- 
guage dependent  lexicon  contains  several  thousand  en- 
tries. This  is  a bottom-up  approach  to  the  language  ID 
problem,  where  phones  are  recognized  first,  followed  by 
words,  and  eventually  language.  Thomas  [46]  has  shown 
that  a language-dependent  lexicon  need  not  be  available  in 
advance;  rather,  it  can  be  learned  automatically  from  the 
training  data.  Ramesh  [39],  Matrouf  [29],  Lund  [28,  27] 
and  Braun  [3]  have  all  proposed  similar  systems. 

3.6.  Continuous  Speech  Recognition 

By  adding  even  more  knowledge  to  the  system,  re- 
searchers hope  to  obtain  even  better  LID  performance. 
Mendoza  [30],  Schultz  [43, 44]  and  Hieronymus  [16]  have 
shown  that  large-vocabulary  continuous-speech  recogni- 
tion systems  can  be  used  for  language  ID.  During  train- 
ing, one  speech  recognizer  per  language  is  created.  Dur- 
ing testing,  each  of  these  recognizers  is  run  in  parallel, 
and  the  one  yielding  output  with  highest  likelihood  is  se- 
lected as  the  winning  recognizer — the  language  used  to 
train  that  recognizer  is  the  hypothesized  language  of  the 
utterance.  Such  systems  hold  the  promise  of  high  qual- 
ity language  identification,  because  they  use  higher-level 
knowledge  (words  and  word  sequences)  rather  than  lower- 
level  knowledge  (phones  and  phone  sequences)  to  make 
the  LID  decision.  Furthermore,  one  obtains  a transcrip- 
tion of  the  utterance  as  a byproduct  of  LID.  On  the  other 
hand,  they  require  many  hours  of  labeled  training  data  in 
each  language  to  be  recognized  and  are  the  most  compu- 
tationally complex  of  the  algorithms  proposed. 

4.  EVALUATIONS 

From  1993-1996,  the  National  Institute  of  Standards  and 
Technology  (NIST)  of  the  U.S.  Department  of  Commerce 
has  sponsored  formal  evaluation  of  language  ID  systems. 
At  first,  these  evaluations  were  conducted  using  the  Ore- 
gon Graduate  Institute  Multi-Language  Telephone  Speech 
(OGI-TS)  Corpus  [34].  The  OGI-TS  corpus  contains  90 
speech  messages  in  each  of  the  following  1 1 languages: 
English,  Farsi,  French,  German,  Hindi,  Japanese,  Korean, 
Mandarin,  Spanish,  Tamil,  and  Vietnamese.  Each  mes- 
sage is  spoken  by  a unique  speaker  and  comprises  re- 
sponses to  ten  prompts.  For  NIST  evaluations,  the  mono- 
logue speech  evoked  by  the  prompt  “Speak  about  any 
topic  of  your  choice”  is  used  for  both  training  and  test- 
ing. No  speaker  speaks  more  than  one  message  or  more 
than  one  language,  and  each  speaker’s  message  was  spo- 
ken over  a unique  long-distance  telephone  channel.  Pho- 
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Figure  4:  A LID  system  that  uses  several  phone  recognizers  in  parallel. 


netically  transcribed  training  data  is  available  for  six  of  the 
OGI  languages  (English,  German,  Hindi,  Japanese,  Man- 
darin and  Spanish). 

Performance  of  the  best  systems  from  the  1993,  1994 
and  1995  NIST  evaluations  is  shown  in  Figure  5.  This 
performance  represents  each  system’s  first  pass  over  the 
evaluation  data,  which  means  that  no  system-tuning  to  the 
evaluation  data  was  possible.  For  utterances  having  du- 
ration of  either  45  s or  10  s,  the  best  systems  can  dis- 
criminate between  two  languages  with  4%  and  2%  er- 
ror, respectively.  This  error  rate  is  the  average  com- 
puted over  all  language  pairs  with  English,  e.g.,  English 
vs.  Farsi,  English  vs.  French,  etc.  When  tested  on  nine- 
language  forced-choice  classification,  error  rates  of  12% 
and  23%  have  been  obtained  on  45-s  and  10-s  utterances, 
respectively.  The  syllabic-feature  system  developed  by  Li 
and  the  systems  with  multiple  phone  recognizers  followed 
by  phonotactic  language  modeling  developed  by  Zissman 
and  Yan  have  exhibited  the  best  performance  over  the 
years.  Error  rate  has  decreased  over  time,  which  indicates 
that  research  has  improved  system  performance. 

Starting  in  1996,  the  NIST  evaluations  have  em- 
ployed the  CALLFRIEND  corpus  of  the  Linguistic  Data 
Consortium.  CALLFRIEND  comprises  two-speaker, 
unprompted,  conversational  speech  messages  between 
friends.  100  North- American  long  distance  telephone 
conversations  were  recorded  in  each  of  twelve  languages 
(the  same  11  languages  as  OGI-TS  plus  Arabic).  No 
speaker  occurs  in  more  than  one  conversation.  In  the  1996 
evaluation,  the  multiple  phone  recognizer  followed  by  lan- 
guage modeling  systems  of  Yan  and  Zissman  performed 
best.  The  error  rates  on  30  s and  10  s utterances  were  5% 
and  13%  for  pairwise  classification.  These  same  systems 
obtained  23%  and  46%  error  rates  for  twelve-language 
classification.  The  higher  error  rates  on  CALLFRIEND 
are  due  to  the  informal  conversational  style  of  CALL- 
FRIEND vs.  the  more  formal  monologue  style  of  OGI-TS. 

The  CSR-based  LID  systems  have  not  been  fully  eval- 
uated at  NIST  evaluations,  because  orthographically  and 
phonetically  labeled  speech  corpora  have  not  been  avail- 
able in  each  of  the  requisite  languages.  As  such  corpora 
become  available  in  more  languages,  implementation  and 


evaluation  of  CSR-based  LID  systems  will  become  more 
feasible.  Whether  the  performance  they  will  afford  will  be 
worth  their  computational  complexity  remains  to  be  seen. 

5.  CONCLUSIONS 

Since  the  1970s,  language  identification  systems  have  be- 
come more  accurate  and  more  complex.  Current  sys- 
tems can  perform  two-alternative  forced-choice  identifi- 
cation on  extemporaneous  monologue  almost  perfectly, 
and  these  same  systems  can  perform  10- way  identification 
with  roughly  10%  error.  Though  error  rates  on  conversa- 
tional speech  are  somewhat  higher,  there  is  every  reason  to 
believe  that  continued  research  coupled  with  competitive 
evaluations  will  result  in  improved  system  performance. 

The  improved  performance  of  newer  LID  systems  is 
due  to  their  use  of  higher  levels  of  linguistic  information. 
Systems  which  try  to  model  phones,  phone  frequencies, 
and  phonotactics  naturally  perform  better  than  those  that 
model  only  lower-level  acoustic  information.  Presumably, 
systems  that  model  words  and  grammars  will  be  shown  to 
have  even  better  accuracy. 

Improved  performance,  however,  comes  at  a cost. 
The  higher  levels  of  linguistic  information  must  be  pro- 
grammed or  trained  into  the  newer  LID  systems.  Whereas 
older  systems  required  only  digitized  speech  samples  in 
each  language  to  be  recognized,  more  modern  systems 
tend  to  require  either  a phonetic  or  orthographic  transcrip- 
tion of  at  least  some  of  the  training  utterances.  State-of- 
the-art  large- vocabulary  CSR  systems  are  often  trained  on 
hundreds  of  hours  of  transcribed  speech.  In  recognition 
mode,  these  systems  tend  to  run  tens  or  even  hundreds  of 
times  slower  than  real-time.  Thus,  the  potential  user  of 
LID  must  balance  the  need  for  accuracy  against  the  need 
for  speedy  deployment  and  low-cost  (and  possibly  real- 
time) implementation. 
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